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ABSTRACT 


With the growing use of data in all aspeets of government and eommeree, the need for 
that data to be both aeeessible and seeure is also growing. One solution to this dual need 
is provided by Aeeumulo, a database that allows multiple users of various seeurity levels 
to aeeess one platform but reeeive authorization to view only portions of the database. 
Various databases, however, organize information differently. This thesis examines the 
possibility of implementing a granular seeurity on a graph database. Using Neo4j as a 
referenee implementation, graph theory eoneepts are used to find a method of allowing 
data aeeess while retaining seeurity in a data environment that emphasizes eonneetivity. 
Using adjaeeney matrix multiplieation on bipartite graph sliees of the network of seeurity 
layers, a mathematieal justifieation exists for loeating two step eonneetions that exit from 
and return to a seeurity layer. These eonneetions ean be revealed to a user without granting 
aeeess outside of the assigned seeurity layer. 
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CHAPTER 1: 

Introduction 


Network science and graph theory are relative newcomers to the worlds of mathematics and 
computer science, but their emergence has highlighted that "most events and phenomena 
are connected, caused by, and interacting with a huge number of other pieces of a com¬ 
plex universal puzzle" [1]. The analysis of these connections reveals information about 
the world that could not have been discovered if those same pieces of information were 
examined in isolation. 

Databases have long served as a tool where data can be both stored and linked. Starting in 
the mid-1980s, database designers were able to decouple the manipulation of information 
from the computer’s storage processes [2]. This split allowed database use to become 
more accessible to non-expert users who did not have extensive knowledge of programming 
code or computer architecture. These early computer database models, called relational 
databases, employed Structured Query Language (SQL) to allow for more human-readable 
and human-writable queries. 

SQL databases, however, encountered competition with the rise of spreadsheets, which in¬ 
cluded a more straightforward way to manage and manipulate small data sets, a task previ¬ 
ously possible only through the use of a relational database [2]. The prevalence of modern 
"big data" has seen a resurgence of the popularity of databases. These newer databases, 
collectively referred to as No-SQL databases, depart from the traditional relational model. 

Although they are lumped into one category, the data storage models of these No-SQL 
databases are highly varied. These various models allow data to be stored, searched, and 
analyzed in different ways. Frequently, these models also provide some advantage over a 
traditional relational model either in the speed with which the data can be searched or the 
complexity of query and analysis options. 

The differences between these models can be compared to finding a specific room number 
at two different hotels. The searcher’s approach to finding the room would be different 
for a high-rise in the city than it would be for a beach bungalow. The same is true with 
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data storage models; how the data is stored and searched varies according to the goals 
and designs of the data model. In particular, one type of new database model is constructed 
around the ideas of graph theory, which permits data to be organized in a way that leverages 
its inter-connectivity. 

With database security, it is typically assumed that any authorized user for the database 
should have access to the entire data set. With the size of databases steadily increasing, 
this blanket access may no longer be appropriate. Securing data within a database to allow 
users access only to a portion of the entire data set allows large databases to grow while 
imposing restrictions to prevent every user from having access to the entire database. This 
partitioned data access, called granular security, is not widely utilized. 

One database, Accumulo, successfully implements granular security control on a more 
common No-SQL data storage model. Based on Google’s BigTable database, Accumulo 
adds a security control feature to the underlying design. Users of different access levels 
can all be permitted to access the same Accumulo database, and the database system will 
return information to a given user only if it matches her assigned access level. 

The security restrictions of Accumulo provide a good model, but the organization of the 
data still relies on a design where data points, although they can be linked, are still largely 
isolated. Newer database designs based on a graph model, called graph databases, store 
information with more emphasis on how the individual data points are linked to each other. 
Implementing a granular security feature on a graph database would allow for finer access 
controls; doing so places restrictions on what data is accessible to a user and may, thus, 
limit the benefits that a graph database has for leveraging data connections. 

The goal of this thesis is to explore the feasibility of implementing granular security con¬ 
trols onto a graph database without sacrificing the graph’s ability to connect data. Neo4j, a 
current graph database software application, is used as the base template for implementing 
granular security on a graph data model in the same way that BigTable served as the base 
for the security controls of Accumulo. 

In Chapter 2, the concept of granular database security is further developed. Graph struc¬ 
tures and graph databases, specifically Neo4j, is discussed in Chapter 3. The basic elements 
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of implementing granular seeurity in a graph database and a mathematieal exploration of 
the ehallenges of that implementation is addressed in Chapters 4 and 5, respeetively. Chap¬ 
ter 6 explores the seeurity implementation on a generated graph, and Chapter 7 eoneludes 
this thesis and presents opportunities for future work. 
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CHAPTER 2: 
Security Models 


Implementing granular security on a database would not happen in a vacuum. An orga¬ 
nization or corporation desiring to exert finer controls on access within a database likely 
already has some sort of data security model in place. In a corporate context, these parti¬ 
tions may exist to keep sales, marketing, and manufacturing information separate for most 
employees, but managers might have access to multiple areas. In a government context, 
information is typically segregated into Unclassified, Secret, and Top Secret information. 
This chapter addresses a few generic security models and outlines the current government 
data-management model. It then describes the implementation of granular security on the 
government classification system. 


2.1 General Models 

In order for a granular security implementation to be successful, it must be built on a gen¬ 
eral access security policy that addresses not only what information can be read but also 
what information can be written. Information separation architectures can be hierarchi¬ 
cal, compartmentalized, or some combination of the two. Hierarchical denotes security 
structures where one security level can be considered higher or lower than another, and 
someone who has access to a given level also has access to any lower levels. For example, 
in a government agency, a Secret-level user would, by default, have access to Unclassified 
information because Unclassified is a lower security level than Secret. 

Compartmentalized, on the other hand, denotes security levels that reside on an equal plane; 
access to one compartment does not automatically grant access to another one. In a business 
setting, for instance, accounting and marketing departments operate independently, and 
access to one area does not, by default, grant access to another because the two levels exist 
in parallel with each other. 

In a hierarchical security environment, the Bell-LaPadula model is a set of common security 
properties developed for the Department of Defense (DOD). The model is commonly 
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summarized down to two simple rules: no read up and no write down [3]. These principles 
allow a user to read any information at his own security level or lower and to write any 
information at his own security level or higher. 

Other security models, such as the Biba model for access control, also exist. The Biba 
model, which emphasizes data integrity vice data security, is the reverse of the Bell- 
LaPadula model; its rules are frequently summarized: no read down and no write up [4]. 
Not only do these two models have differing objectives, they are also only applicable to a 
hierarchical security environment. The concepts of up and down do not exist in a compart¬ 
mentalized security structure where security levels are divided in parallel. 


2.2 Data Management 

DOD information is classified into one of the three security levels from least secure. Un¬ 
classified, to most secure. Top Secret. (Confidential has been intentionally omitted both 
for ease of discussion with only three access levels and because no current broadly reach¬ 
ing information system operates strictly at that level.) Each of the levels also contains 
compartments that partition in parallel rather than from top to bottom, making the system 
both hierarchical with levels and compartmentalized within those levels. The diagram in 
Figure 2.1 illustrates this concept. 

A security clearance could allow user access to information at all levels (Unclassified, Se¬ 
cret, and Top Secret), but not all the information at those levels is necessarily accessible 
to that user. The information at each level that the user can access depends on the com¬ 
partments to which she has access. Access to the Secret level with compartments 1 and 
3 will not grant a user access to compartment 2, even though it is also at the Secret level 
(compartment names used here are arbitrary). 

Of note, in the hierarchical structure, access to any level implies access to the lower lev¬ 
els. A user with Secret access, by default has Unclassified access, but that user may not 
necessarily have Top Secret access. All users have access at the Unclassified level. Com¬ 
partments do not function with the same implications; a user with access to compartment 
3 does not necessarily have access to any other compartments. In addition, a user may not 
have access to any compartments at a given level. For clarity, the term levels will be used 
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when referring to a hierarehieal elassifieation system, and eompartments will refer to data 
that is partitioned in parallel. 



Figure 2.1: Security levels diagram. The concentric circles represent hierarchical security levels; 
the gray boxes represent compartments within the security levels. A Secret user with access to 
compartments 1 and 3 would have access to everything red and green, but only the gray boxes 
labeled 1 and 3 (all other gray boxes would not be accessible, even if they are in the red and 
green circles). 


The previous deseription outlines the regulations for aeeess to information aeeording to 
elearance level, but the actual usage of this information may involve accessing physically 
separate systems. A Top Secret user likely will need to access a separate Secret and/or 
Unclassified system to collect all needed data, and even then, connections between the data 
at different levels must be inferred by the user. Having one system with a granular access 
control would enable Top Secret and Unclassified users to access the same information but 
with limitations on what information would be available to them. This composite system 
would also be able to link data from different levels rather than relying on the user to make 
those connections. 
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2.3 Granular Access 

The government model of information systems works well for keeping information from 
leaking from more secure to less secure levels of access; however, the limitation of these 
separate systems prevents data at one level from being easily linked to data at a different 
level. Under the current architecture, the only way to draw connections between informa¬ 
tion at the Secret and Top Secret levels is to replicate that information onto the Top Secret 
system. Extending the process of connecting information on different levels for a single 
data point would require replicating the entire Secret system on the physical Top Secret 
system, for instance, as well as updating regularly since the information is not static. 

If a user needs access to two separate compartments, replicating one system onto another is 
likely not the best solution since those with access to one compartment do not necessarily 
have access to both compartments. In this case, an additional physical system would need 
to be created to house the information from both compartments. These systems would also 
need to account for regular updates. 

If all the information were instead stored on a single physical system and each item of data 
was assigned an access level (and compartment if necessary), then all users or applications 
could have access to the same system. The database, when queried, would return informa¬ 
tion only for the levels and compartments to which the user has access. This granular access 
is not only less cumbersome than physically separate systems, it also allows users, with the 
proper access levels, to connect information points where a connection was previously not 
possible due to the physically separate systems. 

In order to explore granular security in a way that is more broadly applicable, the security 
model utilized in the present work will be simplified. In the illustrations, users will be 
assumed to have authorization to read and write only at their own access level. Generic, 
compartmentalized (rather than hierarchical) security levels of X, Y, and Z will be used. 
This more restrictive model removes ambiguity and confusion that can be caused by more 
complicated models. In addition, this simplified model can apply to either hierarchical or 
compartmentalized security environments that emphasize either data security or integrity. 
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CHAPTER 3: 
Graph Databases 


Relational databases, the most widely used database architecture and historically the only 
widely available database architecture option, are constructed as collections of cross- 
referenced tables. These databases are searched, typically using SQL, for information 
usually collating data from multiple tables into the query output. This storage structure has 
functioned well for several decades; however, with the prevalence of increasingly larger 
and more complex data sets, these relational databases have not been able to scale up. The 
time required for some types of SQL queries to traverse a large, modern database does not 
keep pace with the speed of other applications which depend on the results of these queries, 
causing a choke point in the processing time for modem applications. 

This SQL choke point has led developers to explore alternate means of storing and search¬ 
ing large data sets. Due to the historical prevalence of SQL in relational databases, these 
new architectures are collectively referred to as non-SQL or NoSQL databases. Classifying 
these new, disparate types of databases into categories is not simple or consistent. Some 
common categories include tabular, key-value store, column, document store, and graph 
databases. 


3.1 Graphs and Digraphs 

Graph databases find their roots in the con¬ 
cepts of graph theory; a graph is a group of 
edges and vertices where a vertex, or node, 
can exist independently, but an edge must 
connect two vertices. In the specific graph 
database implementation of Neo4j, the type 
of graph structure used is a directed graph, 
or digraph, where the edge between two 
vertices is one-way directed from one ver¬ 
tex to the other. A basic graph is illustrated in Figure 3.1 and digraphs are depicted 
in Figure 3.2. 




Figure 3.1: Basic graph example. 
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Mathematically, a digraph is the set of vertices, V, where V is non-empty, and the set of 
edges, E, consisting of ordered pairs of vertices. In a digraph, those ordered pairs convey 
directionality of an edge in addition to its existence. For two arbitrary vertices u and v, 
the ordered pairs (m,v) and iy,u) connect the same two vertices, but are oriented in opposite 
directions [5]. Figure 3.2(a) shows a directed edge from vertex A to vertex B indicating 
directionality of the edge that points from A to B, but not from B to A. Figure 3.2(b) is a 
separate and distinct graph where the edge points from B to A, but not the reverse. In order 
to enable pointing both from A to B and B to A, two directed edges between A and B, one 
in each direction as shown in Figure 3.2(c), are required. 



(a) A to B (b) B to A (c) Symmetrical 


Figure 3.2: Directed graph examples. 

Graphs are used for database models when "information about data interconnectivity or 
topology is more important, or as important, as the data itself" [6]. Data related to an entity 
can be stored in the node and related information can be found in how the edges connect the 
various nodes. Specific queries can be run against a graph database such as finding cycles 
or shortest paths [6]. In Neo4j, data is stored in both the nodes and edges, or relationships, 
in the form of labels and properties; this specific data storage structure can vary slightly 
across various graph database models, but the basic elements of nodes and edges are a 
constant. 


3.2 Nodes and Relationships 

In graph database models, nodes and relationships are essential elements where the nodes 
store data about an object and the edges convey data about relationships between ob¬ 
jects [6]. Neo4j employs four essential data constructs in its database architecture: nodes, 
relationships, properties, and labels. Nodes and relationships are the standard nodes and 
edges found in basic graphs, and in the case of Neo4j, the relationships are directed edges. 
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Labels and properties are assigned to nodes and relationships for classification and data 
storage, respectively [7]. 

For nodes, labels enable the grouping of nodes into categories of similar types; for example, 
a large database may have nodes classified as animal, vegetable, or mineral. It is also 
possible for nodes to carry more than one label. The details of each node are stored as 
properties. For a node for fox, which carries the label of animal, details would be items 
such as color (red), number of legs (4), and habitat (forest). 

Queries can be run against the database specifying not only nodes, but also any of the node 
properties. A query could be run against the graph database that broadly returns all nodes 
with the label animal or narrowed to a search that returns all nodes with the label animal 
and a habitat of forest, which would return all animal nodes in the database with the habitat 
property of forest. Conversely, a query could be run to return all animals that have a habitat 
property that is not forest. The label can be helpful for grouping nodes but is not a required 
part of a query; the database could be queried for anything with the color property of red, 
which would return the animal node: fox and, for instance, a vegetable node: tomato. 



Figure 3.3: Simple animal database. 


The relationships of a graph database also contain labels and properties. They are also 
identified by the start and end nodes they connect, making the database a directed graph. 
A relationship with the label eats could connect the fox node to a rabbit node with the 
property nutritional value: protein. 
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A relationship can connect any two nodes regardless of their labels; a rabbit node could 
be connected to a vegetable node carrot via an eats relationship. These relationships are 
unidirectional, so any reciprocal connections are best identified by two relationships, one in 
each direction. A playswith relationship from a rabbit node to a deer node would need to be 
accompanied by an additional playswith relationship that starts at deer and ends at rabbit. 
This reciprocal relationship represented in a graph database would resemble the symmetric 
graph in Figure 3.2(c). The graph of the database with fox, rabbit, and deer nodes can be 
found in Figure 3.3. 

The directionality of relationships in Neo4j is semantic in nature but does not prohibit a 
search of the graph database from traversing against the direction of the edges. Using 
the example of Figure 3.3, the eats relationship points from fox to rabbit because foxes 
eat rabbits, not the reverse. If I wanted to search this database to find what animals eat 
rabbits, I would be able to construct a query that will follow all of the eats relationships 
that terminate at rabbit backward to find the originating nodes. Although the graph can be 
searched counter to the direction of the relationships, the directionality of the relationships 
are firm (i.e., foxes eat rabbits, not vice versa). The directionality of the relationships, 
not the direction of traversal, is used in this thesis to better understand security in a graph 
database. 


3.3 Neo4j and Cypher 

In Neo4j queries of the graph database are executed using Cypher, a custom, "declarative, 
pattern-matching query language" [7]. This section is meant to be a brief introduction to 
Cypher with a focus on the basic language structure and commands used to implement and 
illustrate granular security; for a more comprehensive discussion of Cypher see references 
[7], [8], [9]. 

The primary commands in interacting with Neo4j are CREATE, MATCH, and RETURN. 
Other common commands with high utility that will not be discussed in-depth are WHERE, 
DEEETE, SET, EOREACH, and MERGE; these commands illustrate the breadth of 
Cypher, but are not necessary for understanding granular security. 

The RETURN command is typically used in conjunction with MATCH to instruct the 
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database to return the results of the query, as opposed to deleting or merging them instead; 
without the RETURN command, Cypher will match the requested query, but then not ac¬ 
tually display the results to the user. The MATCH and CREATE commands, although they 
are used for different purposes, follow the same syntax. MATCH is used to locate nodes 
and relationships in the database that correspond to the query criteria. CREATE generates a 
new node(s) and/or relationship(s) according to the structure of the command. A CREATE 
UNIQUE command can also be used to prevent duplication of data. 

The MATCH and CREATE commands can be as simple or complex as desired. A simple, 
empty node query would match and return the entire contents of a database: 

MATCH (a) 

RETURN a 


would match and return the entire contents of a database by finding all nodes and their 
associated edges. A similar query could also be constructed that queries edges instead of 
nodes: 

MATCH ()-[r]->() 

RETURN r 


This query would also return the entire contents of a database by returning all of the edges 
and their associated nodes. Cypher commands can also be accompanied by relationships, 
labels, and properties. A more complex instruction would take the form: 

CREATE (a:Labell {Propertyl: "data"}-) 

- [:RELATIUNSHIP {Property2: "quality"}]-> 

(b:Label2 {PropertyS: "info"}) 

This command would create two nodes with a relationship between them. The nodes would 
be of type Eabell and Eabel2, each with different property elements. The connection be¬ 
tween them is of the type REEATIONSHIP with its own property. Each component of 
the instruction can contain multiple or no labels or properties; the degree of specificity de¬ 
pends on what the user wishes to accomplish. The "a" and "b" in the command are local 
variables used within the context of the command to reference already specified nodes as 
in the MATCH and RETURN command above. 
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The fox and rabbit nodes from Section 3.2 can be used in a create command: 


CREATE (a:Animal {name: "fox", color: "red", legs: "4", habitat: "forest"}-) 
-[:EATS {nutritionalValue: "protein"}-]-> 

(b:Animal {name: "rabbit", color: "gray", legs: "4", habitat: "forest"}-) 

This command will create the fox and rabbit nodes, including their labels and properties, 
with an eats relationship from fox to rabbit with its property. Once that data is entered into 
the database, a query could be run to find out what animals are prey for other animals: 

MATCH (a:Animal) -[:EATS]-> (b:Animal) 

RETURN b 


This query will return the rabbit node (along with any other animal nodes that are eaten 
by other animals. Using the node labels here, prevents the database from returning other 
things that are eaten, such as plants. 

Queries can also be run against a Neo4j database with much more complexity; Cypher can 
find and match any pattern that exists within the database. For instance: 

MATCH (f:Animal) -[:PLAYSWITH]-> (a:Animal) -[:EATS]-> (b:Animal) 
-[:EATS]-> (c:Animal) -[:EATS]-> (d:Animal) 

RETURN f,a 

would return the any animals that are both high up in the food chain and have an animal 
that they play with; the query would also return the animal that is the playmate. A full 
graph database of all animals would likely return a human node (a in the query) because 
humans are at the top of the food chain and they play with other animals; a dog node (f in 
the query) would also be returned because humans play with dogs. 

Cypher has much more depth and complexity to offer beyond these simple illustrations, but 
the use of the CREATE, MATCH, and RETURN commands is sufficient for understanding 
the implementation of granular security. The implementation of that security relies on the 
use of the property elements for nodes and relationships. 
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CHAPTER 4: 

Granular Security on Graph Nodes 


Granular security in a graph database must address the two primary facets of graphs: nodes 
and relationships. Node security will be addressed in this chapter, and the more compli¬ 
cated edge security will be introduced in this chapter and then be examined in depth in 
Chapter 5. 



Figure 4.1: Simple manipulation graph. 


To examine the various graph structures, the simple graph displayed in Figure 4.1 will be 
used for illustrative purposes. This graph contains nine nodes and 12 relationships; three 
nodes will be classified at each security level (X, Y, and Z as discussed in Chapter 2). The 
graph does not represent any actual data set but is examined as though every node and every 
relationship contains data. 
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4.1 Assigning Security 

The nodes, being the principal elements of a graph database, must be directly assigned a 
security property according to the sensitivity of the data represented. These classifications 
determine which parts of the database a user can access. With no granular security on 
a graph database of Figure 4.1, a query for all the nodes would return exactly what is 
displayed. 




Figure 4.2: Query with node security. 


Because nodes are both necessary and sufficient elements of a graph (i.e., a graph can con¬ 
tain nodes and no edges, but a graph cannot contain only edges because they, by definition 
connect nodes), they must receive a security classification at the time they are created, and 
they must always retain some security designator. While the security classification can be 
changed, it cannot be deleted from the node. 



Figure 4.3: Connection not visible with node security. 

The same database with granular security implemented on the nodes would return a subset 
of the total graph according to the user’s assigned security level. A user with access to only 
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Y-level information eould query the database for all nodes, but the results of that seareh 
would be restrieted. As shown in Figure 4.2, it would return all Y-level nodes rather than 
all nodes. 

What Figure 4.2 does not show to the Y-level user are indireet eonneetions that exist when 
a relationship extends to a non-Y-level node and then baek to a Y-level node. The two-edge 
eonneetion between the Y1 and Y3 nodes via XI seen in Figure 4.3 is a link hidden from 
A Y-level user. While the data at the XI node lies outside of the user’s aeeess level, the 
eonneetion between the two Y-level nodes may be a valuable link that is not visible to the 
user using only node seeurity. 


4.2 Relationship Types 

The main goal of any database is to store data in sueh a way that a user ean aeeess that data 
in a way that is useful. A graph database has the added objeetive of showing eonneetions 
between data points the user might not be able diseem in another data storage strueture. In 
the ease of Neo4j, those relationships can also store data that expands on the nature of the 
relationship between two nodes. 

The limitation of granular security on a graph database is that it shows only relationships 
in the subgraphs of a node set at a given security level; connections between data points 
in different security compartments would not be visible to users operating one of those 
compartments, but not both. In a hierarchical security model, the connections occurring 
with lower security levels would ideally be visible, but valuable connections that exist to 
a higher level would not be seen by a user at the lower level. Without these cross security 
level connections accessible in some way, using a graph database of combined security 
levels provides no data analysis advantage over keeping the compartments on physically 
separate systems. 

The most straightforward way to handle the issue of edges would be to individually classify 
all of them in the same way that the nodes are classified. While this method is likely the 
most secure, because each edge must be individually assigned a security level, it is also 
the most cumbersome and the most likely to lead to the least optimal data organization for 
observing connections between data points. Individually classifying relationships is a good 
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option to have for handling specific pieces of information, but it is not the best route for a 
general data disclosure policy for the relationships between nodes of the database. 

Three main types of relationships between nodes of different levels exist: trivial, restricted, 
and induced. Same and different are relative terms that designate whether the two nodes 
at the ends of a relationship are of the same or different security levels. These relationship 
types will be introduced here and a graphical analysis of the most optimal policy for han¬ 
dling these relationships will be discussed in Chapter 5. The nodes for illustrating these 
relationship types will be labeled "X" and "Y" where "X" is the security of the current user 
and "Y" is some other security level. 

4.2.1 Trivial Relationships 

A relationship from a given node to another 
node of the same classification level, as il¬ 
lustrated in Figure 4.4, is the trivial case for 
this exercise. The two nodes would both be 
visible to a user at level X, and thus the re¬ 
lationship between them would be returned 
for any query that returns those two nodes. 

4.2.2 Restricted Relationships 

A relationship from a node at one security 
level to a node at a different security level, as illustrated in Figure 4.5, would not be seen by 
a user operating at either of the security levels. A user at level X would only see the X node, 
and a user at level Y would only see the Y node; neither one would see the opposite node nor 
the presence of a connecting relationship. Because Neo4j is a directed graph, the restricted 
relationships can be either an outbound relationship, as illustrated in Figure 4.5(a), or an 
inbound relationship, as illustrated in Figure 4.5(b). These two cases, although they are 
directionally opposite, will be addressed as the same type. 

With this relationship, it is unclear what the classification is of either the existence of the re¬ 
lationship or of the content of the relationship. Whether or not the X- or Y-level user should 



X -K X 



Figure 4.4: Trivial relationship. 
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be able to have access to the relationship could also be dependent on if the relationship is an 
inbound relationship to the user’s security level or outbound from the user’s security level. 
Even though it restricts the visibility of the graph’s connectivity, the recommended policy 
for this type of relationship is to only allow it to be visible to a user if that user has access 
to both the X and Y security levels. The graph of Figure 4.1 contains examples of each 
relationship type. For a user at level X, only the X node should be visible with granular 
security. The Y-level node should only be visible to a Y-level user and the relationships 
between the X- and Y-level nodes should only be visible to a user who as permission to 
access both levels. 



(a) Outbound (b) Inbound 

Figure 4.5: Restricted relationship. 


4.2.3 Induced Relationships 

The induced relationship is a set of two re¬ 
lationships. The first relationship is a sim¬ 
ple restricted relationship that begins in the 
user’s security level. The second relation¬ 
ship is another restricted relationship that 
begins at the terminal node of the first rela¬ 
tionship and points back to a node in the 
user’s security level creating a two-edge 
path that starts and ends in the same secu¬ 
rity level. 

This relationship type is illustrated in Figure 4.6. In this relationship a connection exists 
between two nodes of the same classification level that a user at that classification level 
would not be able see. Chapter 5 will explore this relationship type using graph theory 
methods to find a mathematical justification for showing the induced relationship (marked 
with a ’?’) that exists between the two X-level nodes. 



Figure 4.6: Induced relationship. 
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More complicated relationship types exist that expand upon the basic induced relationship 
type. These types will be briefly addressed in Chapter 7, but they remain largely outside 
the scope of the present work. 


4.3 Possible Relationship Security Policy 

A potential security policy for keeping data from being disclosed to unauthorized users 
while leveraging the connectedness of information within a graph database requires a 
thoughtful balancing act. In all cases of a security policy implemented directly on the 
nodes of a graph, a user at a given security level would be able to access all the nodes at 
the corresponding level (i.e., a Y-level user would be able to access all Y-level nodes). 

In the case of trivial relationships, a security policy should allow a user at a given security 
level to access the relationships that both originate and terminate at the user’s access level. 
Restricted relationships, regardless of whether they originate or terminate in the user’s 
security level, should only be accessible to a user if she has access to both security com¬ 
partments within the database. Because the relationships cannot exist without the nodes, 
having a relationship that points to nowhere does not fall within the typical parameters of 
a graph database, and knowing that a relationships exists without knowing the other end of 
the connection is not likely to provide much additional information. 

For induced relationships that cross to another security level and then back to the user’s 
security level the relational elements that graph databases provide can be leveraged by 
giving the user access to the induced relationship that exists between the two nodes. This 
induced relationship allows the user to see the presence of the data connection without also 
having access to the information in a security level to which she does not have access. 
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CHAPTER 5: 

Security Exploration on Graph Edges 


Understanding the nature of security on the edges in a graph database requires exploration 
of the possibilities of graph representation. Organizing the same graph using different 
network models can reveal characteristics of the connections that become visible in the 
different models. 

The goal of the present chapter is to use the construction models of graph theory to find a 
mathematical justification for generating the edges of induced relationships within a large 
graph database. On a small graph, finding those relationships is easy to accomplish vi¬ 
sually; however, on a large data set, finding the induced relationships would need to be 
accomplished via a sound mathematical algorithm. 

Using a simple manipulation graph from Figure 4.1, the different ways in which that same 
graph can be represented will be illustrated and evaluated for its applicability to producing 
the induced relationships as well as the overall impact to the graph database. The discussion 
will begin with ruling out hypergraphs as well as weighted and temporal graphs. Colored 
graphs, although germane to the task, will next be shown to provide no additional insight 
into the problem at hand. The chapter will conclude with a look at the bipartite feature 
multislice graphs to produce a mathematical process for finding induced relationships. 


5.1 Hypergraph 

A hypergraph is comprised of nodes and edges; however, the edges are not limited to con¬ 
necting a maximum of two nodes. Edges in hypergraphs, called hyperedges, can pass 
through any number of nodes between the two end nodes [10]. For a database in Neo4j, 
hyperedges would replace the directed relationships and their content. Instead of a relation¬ 
ship that contains information, each edge would become a directed hyperedge that would 
have the same beginning and ending nodes as the regular directed edge, but would also pass 
through a fabricated relationship-node. These hyperedges are implemented on the simple 
manipulation graph in Figure 5.1 with each of the new nodes labeled he. 
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This construction with hyperedges leaves a graph database that has many more nodes, but 
with relationships that do not store any eontent. Classifying the eontent of the relationships 
beeomes a moot issue sinee that data is now stored in nodes, whieh must reeeive a seeurity 
designation. 



Figure 5.1: Simple manipulation graph with hyperedges. 

Creating these additional nodes does not address the issue of how the database should ad¬ 
dress the restrieted and indueed relationships that eross between seeurity levels. In addition, 
ereating more nodes with independent elassifieation levels eould inerease the prevalenee of 
these ambiguous relationships. While a hypergraph eonstruet allows for an easy way to 
direetly elassify the eontent of the edges, it does not aid in addressing the issue of how to 
handle the more eomplieated, eross-eompartment seeurity relationships. 

5.2 Weighted Graph 

A graph with weights assigned to eaeh edge is called a weighted graph. In these graphs 
traversing from one node to another along an edge is assoeiated with a eost; the distanee 
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between cities is an example of a weight that might be placed on an edge. For example, the 
distance from New York to Los Angles is about 2,500 miles while the distance from New 
York to Philadelphia is only 100 miles. Thus, the weight of an edge connecting the New 
York and Los Angles nodes would be 2,500, and the weight of an edge connecting the New 
York and Philadelphia nodes would be 100. The edges could be weighted via any metric 
appropriate to purpose of the graph; travel time, fuel consumption, or number of rest stops 
are examples of other weight options. 



Figure 5.2: Simple manipulation graph with edge weights. Relationships are given weights 
according to the following-X to X: 1, Y to Y: 4, Z to Z: 7, X to Y: 3, Y to Z: 5, X to Z: 6. 


In a graph database with cell level security, the weights could serve as a way to regulate 
the edges a user can access. If a user is assigned a path traversal budget that is applied 
independently, not cumulatively, to each path (either in addition to or instead of a security 
clearance level), then he would be able to access any paths that are less than or equal to 
that maximum budget. The simple manipulation graph with weights applied is found in 
Figure 5.2. 
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In a weighted graph database, the user’s security level along with his traversal budget would 
determine what information was accessible. A user at the Y-level with a traversal budget of 
4 would be able to see all the trivial relationships at the Y- level, because they each weigh 4, 
as well as be able to gain information about relationships that extend down to the X-level, 
because they each weigh 1, but not relationships that extend to the Z-level, because they 
each weigh 5. He would also not be able to see any induced relationships to the X-level 
because each hop on the path to the unclassified level and back weighs 3 for a total induced 
cost of 6, more than the user’s traversal budget. 

If that same user were given a travel budget of 6, he would be able to see the relationships 
up to the Z-level as well as the induced relationships to the X-level. If the traversal budget 
is used with the security level, the nodes outside of the user’s security level would still not 
be accessible. 

Displaying the relationships that are anything other than the trivial type would entail some¬ 
how conveying a relationship without all the end nodes. The most straightforward way to 
display relationships in this way would be to make them hyperedges, while doing so does 
solve the problem of displaying one-sided relationships, it does not allow the user to ac¬ 
curately see induced connections. This solution ultimately leaves the user with the same 
limitations that exist with using a hypergraph. 

Using a weighted graph model without also classifying the nodes is another option imple¬ 
menting granular security. In this construct a user would still be given a traversal budget 
that would, in this case, not be restricted to the visibility of the nodes. Only assigning a 
weighted traversal security to the nodes, however, would only be applicable to hierarchical 
security environments; a traversal budget with a large value would be able to see all edges 
of lesser value. In addition, devising a scheme to appropriately assigned edge weights and 
traversal budgets to ensure proper disclosure to all users would require a difficult algorithm 
with high potential for giving improper access to data. 


5.3 Temporal Network 

A temporal network is similar to a weighted graph where the weights instead operate sim¬ 
ilar to timestamps. In this graph type, the presence of an edge depends on when along 
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a timeline the graph is examined [11]. Normally this type of graph is employed to show 
how the eonneetions in a graph ehange over time. But, if windows of time are instead 
seen as seeurity levels, the eoneept eould potentially translate to allow aeeess aeeording to 
the values of the weighted edges. For instanee, supposing that the weights of Figure 5.2 
are timestamps, the graph at timestamp 4 would show only the trivial relationships at the 
X-level. 

Timestamps ean also be assigned for periods of time. If a user were given aeeess to all 
timestamps 4 and less, then the relationships at the X and Y levels would be visible. Simi¬ 
larly a timestamp aeeess from 3 to 5 would give a user aeeess to all relationships eonneeting 
to and/or from the X-level. Although the timestamps appear to be relegated to a hierarehi- 
eal system, they eould be implemented in a way that makes eompartmentalization possible. 
User timestamp aeeess eould also be disjoint; a user with aeeess to timestamps 1, 6, and 7 
would be able to see the Z and X-levels as well as the eonneetions between them. 

The temporal framework provides an exeellent model for examining edge seeurity, but the 
timestamps only aet as a proxy for assigning edge seeurity direetly. In a seeurity environ¬ 
ment where this direet assignment is desirable the temporal model would provide insight 
into how to best eonduet that arrangement. The model, however, does not provide addi¬ 
tional means for leveraging the restrieted and indueed relationships. 


5.4 Coloring 

Although eoloring graphs has its foundations in eartography, it is easily applieable to nodes 
and edges [5]. Applying eolors to a graph ean help to bring out patterns and eonneetions 
that are not readily apparent in a graph without those elassifieations. Graph eoloring is typ- 
ieally employed for situations where two or more graphs may exist mostly independently, 
but those quasi-independent graphs have eonneetions tying them together [12]. The seg¬ 
regation of a graph database into separate seeurity lends itself to viewing those separate 
eompartments and independent graphs with eonneetions linking the data aeross seeurity 
levels. 
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5.4.1 Node Coloring 

Applying node coloring to a network requires assigning a single color to every node in 
the graph. Traditionally the coloring is applied in such a way that no two adjacent nodes 
share the same color; in the present work, however, the node coloring serves as a node 
label. A graph database with granular security applied is essentially already node colored. 
The security labels serve to separate the nodes into categories-the intention of coloring a 
network. Associating colors with the security levels, as can be seen on the simple graph in 
Figure 5.3 will give actual colors to the existing node labeled graph. 



Figure 5.3: Simple manipulation graph with node coloring. Colors are assigned according to the 
following-X:green, Y:red, Z:blue 


5.4.2 Edge Coloring 

Similar to node coloring, edge coloring is applying a single color to every edge in a net¬ 
work; in the same way, the edge colors serve as labels and, like the node colors, are not 
applied in the typical non-adjacent fashion. The way edges are classified and colored can 
vary. 
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In a compartmentalized graph, they ean be eolored based on the security levels where 
they originate, where they terminate or both. They can be colored according to whether 
or not they cross a security level. In the graph database where relationships are given 
types, the edges eould be eolored based on those labels. Given that the sample graph does 
not have relationship labels, the simple graph in Figure 5.4 is edge-colored based on the 
elassification of the node where the edge originates. 



Figure 5.4: Simple manipulation graph with edge coloring. Colors are assigned based on the 
classification of the originating node according to the following-X:green, Y:red, Z:blue 


5.4.3 Combined Coloring 

Use of simultaneous edge and node eoloring on the simple graph can be seen in Figure 5.5. 
These eoloring schemes do not aetually provide any new information. While the ehromatic 
highlights bring visual attention to the features of the graph, their assignment is based 
solely on labels or attributes that were already present in the graph. 

In Figure 5.5 the number of nodes at eaeh elassification level can be easily aseertained; 
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in addition, the connections that link nodes of different security levels are also apparent. 
The limitation is that the classification of the nodes and the attributes of the edges are 
not any different now than they were before the colors were applied. Any graph database 
large enough to be useful will be much too large to examine visually, making the color 
assignment an aesthetically pleasing feature, but one with no practical utility. 



Figure 5.5: Simple manipulation graph with node and edge coloring. 


5.5 Multilayer Graphs 

The idea of a node colored graph, distinguishing nodes based on their labels, can be applied 
in an alternate way yielding a graph that can be analyzed differently. Instead of distinguish¬ 
ing nodes of the same security level by assigning a color, grouping the nodes of the same 
level all to a single layer reveals where relationships cross from one security level to an¬ 
other in a way that can be utilized for analysis. Figure 5.6 shows the simple graph rendered 
as a multilayer graph with each node on its appropriate classification layer. 


28 





5.5.1 Multiplex Graphs 

Studies of multilayer networks frequently render graphs as multiplex networks, a subgroup 
of multilayer networks [13], [14]. Multiplex networks are a speeifie type of multilayer 
network where every node has a eounterpart in every layer of the network [13]. 



Figure 5.6: Simple manipulation graph as multilayer graph. Edges that remain originate and 
terminate on the same layer are solid lines; those that cross layers are dotted lines. The di¬ 
rectionality of the network is omitted in this image due to the limitations of the visualization 
software. 


This replication is sometimes done for simplicity, but it is usually used to represent con¬ 
structs such as social networks where one node will have a presence in more than one 
layer. For example, a single social media user may have an account in Twitter, Facebook, 
and Instagram with different connections for each site. Those different services are then 
each represented as a single layer within a broader social network. A user with no Twitter 
account would just be an unconnected node in that layer. 
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Typically, in a multiplex network, connections between layers can only be made across the 
replicated nodes in each layer. The node for user A on the Facebook layer can only have 
intra-layer connections to user A on the Twitter or Instagram layers. This restriction re¬ 
flects the reality that user A’s Facebook account cannot connect directly to user B’s Twitter 
account. 

A security environment where the nodes are intentionally kept separate from each other 
does not lend itself to this common multiplex model. Copying a Z-level node into the X 
layer would defeat the purpose of having granular security for the database. While the 
copies of each node in multiple layers would allow for better access when relationships 
cross security levels, the security features would be rendered useless, and the result would 
ultimately be similar to producing multiple copies of one database. The more general idea 
of a multilayer graph is better suited to the needs of a secure database model. 


5.5.2 Multislice Graphs 

Applying slices to layers in a multilayer graph requires examining a separate layer or layers 
of the graph independently, allowing examination of subsets of edges based on those layers 
[15]. Multislice graphs are typically applied to multiplex graphs with all nodes replicated 
in every layer [12]; however, they can also be applied to generic multilayer graphs. 

For a compartmentalized graph database, slicing isolates the individual layers as well as 
the connections between any two layers. In Figure 5.7, six slices of the simple graph 
are displayed. Each individual security level has its own slice and an additional slice is 
constructed for each pair of security levels. The number of slices will grow in proportion 
to the number of security levels in the database according to: 

R|= l'^,.l(l^"l + l) (5,1) 

where |5„| is the number of slices required and |L„| is the number of layers (i.e., the number 
of security levels in the database). 

The individual slices pull out the trivial relationships, which does not add much to the 
ability to analyze the whole graph. The slices that contain pairs of security layers, on the 
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other hand, isolate the restrieted relationships apart from the edges that do not eross seeurity 
levels. In Figure 5.7(d), the induced relationships can be clearly seen. A connection exists 
from node XI to node X2 and from node Y1 to node Y3. In the case of the former edge, a 
parallel relationship already exists in the graph, but the database can accommodate multiple 
relationships between the same pair of nodes. In the latter case, the relationship between 
the two nodes is not readily visible before examining the slice. 





(a) Unclassified Slice 


(b) Secret Slice 


(c) Top Secret Slice 





(d) U to S Slice 


(e) U to T Slice 


(f) S to T Slice 


Figure 5.7: Simple manipulation graph in multislice format, (a), (b), and (c) each show a slice 
with one security level and all the trivial connections at that level, (d), (e), and (f) each show 
a slice with a pair of security levels. The dotted lines represent connections between levels, 
and the dashed lines in (d) represent induced connections across the two security levels. The 
directionality of the network is omitted in this image due to the limitations of the visualization 
software. 


In a large graph, visually finding the induced relationships is not a feasible task. Mathe¬ 
matically finding those inferred relationships, however, is a possibility using the slices that 
contain the relationships between the security levels. 


5.6 Bipartite Graphs 

Each of the paired slices makes up a subgraph that is bipartite. Bipartite graphs are graphs 
where the nodes can be split into two groups (the security levels in this case), and edges 
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connect nodes from one group to another with no edges running between two nodes in 
the same group [5]. Because the trivial relationships are not contained in the paired slices 
the only relationships present connect nodes from one security level to another making the 
paired slices bipartite graphs. 

The bipartite graphs that contain only the edges that connect two layers (Figures 5.7(d), 
5.7(e), and 5.7(f)) can be used to mathematically locate the edges that exist behind any 
induced relationships. Using multiplication of the adjacency matrices of those bipartite 
graphs results in a new adjacency matrix for the induced edges. 


5.6.1 Adjacency Matrices 

An adjacency matrix is a matrix representation of a graph. The columns and row indices 
of the matrix represent the nodes of the graph, so a matrix with n nodes would have an 
nxn adjacency matrix. In an undirected graph, there will be two ones in the matrix for 
every edge and zeros for all other locations. For an undirected graph with an edge (u,v) the 
associated adjacency matrix will have a 1 at row u, column v and a 1 at row v column u. 

If the simple graph were undirected, then the associated adjacency matrix would be repre¬ 
sented as: 


XI 

X2 

X3 

71 

72 

73 
Z1 
Z2 
Z3 


XI 

( ^ 

1 

1 

1 

0 

1 

0 

0 

V 1 


X2 X3 

1 1 

0 0 
0 0 
0 0 
0 0 

1 0 
1 0 

0 0 
0 0 


71 

1 

0 

0 

0 

0 

0 

0 

1 

0 


72 73 

0 1 
0 1 

0 0 
0 0 

0 1 
1 0 

0 0 
0 0 

1 0 


Z1 Z2 Z3 
0 0 1 \ 

10 0 
0 0 0 

0 10 
0 0 1 

0 0 0 

oil 
10 0 
1 0 0 / 


(5.2) 


The 1 at the row for node XI and the column for X2 represents the edge from XI to X2 
and has a corresponding 1 at the row for X2 and the column for XL The simple graph. 
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however, is a direeted graph and the edges do not flow in both direetions. 


5.6.2 Directed Adjacency Matrices 

For a direeted graph, two methods exist for representing the assoeiated direeted adjaeeney 
matrix. One way involves using ones to represent an edge that is direeted from the row 
node to the eolumn node and a negative one to represent the edge at the inverse row and 
columns of the matrix. In this method the (XI, X2) value would be one because the edge is 
directed from XI to X2 and the (X2, XI) location would be a negative one. 

The other method for representing directed adjacency matrix is to use ones for the direction 
of the edge and to leave the inverse location as a zero. Using this representation the directed 
adjacency matrix for the simple graph becomes: 


XI 

X2 

X3 

71 

72 

73 
Z1 
Z2 
Z3 


XI 
/ 0 
0 
0 
1 
0 
0 
0 
0 

V 0 


X2 X3 71 
110 
0 0 0 
0 0 0 
0 0 0 
0 0 0 
10 0 
10 0 
0 0 0 
0 0 0 


72 73 Z1 Z2 Z3 

0 1 0 0 1 \ 

0 0 0 0 0 

0 0 0 0 0 

0 0 0 1 0 

0 0 0 0 0 

1 0 0 0 0 

0 0 0 0 0 

0 0 10 0 
10100 / 


(5.3) 


For an undirected bipartite graph, the partial adjacency matrix, where one group is the 
rows and the other group the columns, the matrix can be multiplied by its transposition, the 
opposite matrix where the groups switch rows and columns, to return a third matrix that 
shows the instances where a node is connected to another node in the same group with a 
node from the other group between them [16]. 


Using the undirected matrix for the Unclassified and Secret slices yields the matrix multi¬ 
plication: 
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where the entries along the diagonal represent the degree of each node (XI is degree, X2 is 
degree 1, and X3 is degree 0, which agrees with Figure 5.7(d) and the other ones represent 
the induced relationships from the bipartite graph on the Unclassified layer, namely the 
relationship from XI to X2. Reversing the order of the multiplication yields: 
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which reveals the degree of the nodes on the Y-layer and the induced relationship between 
nodes Y1 and Y3. 

Splitting the directed adjacency matrix according to the security level paired slices of 
Section 5.5.2 gives adjacency matrices for those directed bipartite subgraphs. The graph 
database and the simple graph, however, are both directed graphs. When performing the 
same multiplicative process on the directed adjacency matrices, the second matrix is not 
the transposition of the first, but rather the adjacency matrix for the edges pointing to the 
opposite group. The directed adjacency matrix multiplication for both directions yields: 
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Equations (5.6) and (5.7) show the directed edges that were visually added to Figure 5.7(d). 
This matrix multiplication process can be performed over a large graph where visual in¬ 
spection cannot. Separating the larger graph into two-layer slices allows for the isolation 
of the edges that cross from one security level to another. The multiplication of the re¬ 
sulting bipartite adjacency matrices mathematically generates the edges that result from an 
induced relationship. 
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CHAPTER 6: 

Granular Security Implementation 


Implementing granular security in Neo4j would require adding a protocol to the database 
that would take a user’s query and translate that input into a secure query. The database 
system would need to add elements to the user query that would only return results that 
align with the security authorizations of the user. 


6.1 Granular Security through Properties 

Implementation of granular security in Neo4j can be done through the use of properties 
within the nodes and relationships. Using a reserved word (e.g., security) as a property 
with the appropriately assigned values implements granular security on the database nodes. 

A MATCH or CREATE command could perform a hidden operation, using the same 
Cypher format, to match or create nodes with a security property holding a value equivalent 
to the authorization found for the user in the auth file. A user’s MATCH command could 
be appended with an additional security property after the user enters the command. The 
initial command that would return all nodes would have the property added to it before it 
was actually queried against the database: 

MATCH (a {security: "Y"}-) 

RETURN a 


Instead of returning all nodes, this query would return only the nodes that matched the "Y" 
security property. The addition of a hidden security property, which the user cannot modify 
directly, allows for the implementation of granular security. 


6.2 Proof of Concept 

Neo4j, in addition to distributing the community version via free download, also hosts their 
source code on GitHub and invites users to contribute to the development of the software. 
This Neo4j source code file is designed to easily import the addition of user-written plugins 
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that can expand on the query eapabilities of Cypher. 


Writing and using plugins eomes with several limitations. Beeause they are not embedded 
in the Cypher eompiler, they eannot be run from the Graphieal User Interfaee (GUI) as a 
normal query would. Plugins must be exeeuted either through eommand line interfaee with 
Neo4j or though the GUI eonsole in Hypertext Transfer Protoeol (HTTP) mode. A plugin 
also does not have the neeessary file permissions to aeeess user data where information 
about a user’s authorization tokens would be stored. 

The plugins are simply additions to the Neo4j eommand library, thus, they eannot replaee 
existing Cypher eommands. A new MATCH eommand implemented via plugin that re¬ 
quires a seeurity level to exeeute eould simply be avoided by using the standard MATCH 
command of Cypher. The plugin to implement a mateh query serves as a proof of eon- 
eept that the granular seeurity eould work as antieipated rather than serving as a seeure, 
funetioning aeeess eontrol tool. 


6.3 Plugin 

Plugins ean be written to exeeute queries that simply perform speeifie Cypher queries 
through the HTTP eonsole. The plugins are either written fully in Java or use Java to 
exeeute a Cypher command. The AltMatch plugin, written to implement granular seeurity 
on Neo4j nodes, performs MATCH and RETURN operations on all nodes in the database 
with the seeurity elearanee level speeified by the user. 

From the HTTP eonsole the AltMateh eommand ean be ealled with a GET request: 

GET /altmatch/<clearance> 

The strueture of the eommand depends on how the plugin file for the plugin is paekaged; 
in this ease the "altmateh" portion of the GET request instruets the eonsole to use the 
AltMateh plugin file. The "<elearanee>" of the GET request allows the user to speeify the 
seeurity level of the nodes they query should return. 

Using the artifieial seeurity labels for illustration, a query to seareh for all Y-level nodes 
would take the form: 
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GET /altmatch/Y 


This command would return the data for each Y-level node in the database. Supposing that 
the fox and rabbit nodes from Chapter 3 have a security property of "Y", the output from 
the above query would be: 

==>200 OK 

==> {"matches": ["f ox", "rabbit"] }■ 

The deer node, which has a different classification, would remain hidden. Because the Alt- 
Match plugin returns the results of a specific query, returning more information about the 
two nodes would require a separate plugin that would return the desired information. All 
separate queries would require separate plugins to run with the security level requirement 
making the plugin functionality not a realistic option for implementing granular security. 


6.4 Cypher GUI 

When called through the HTTP console, the AltMatch plugin runs the Cypher query: 
MATCH (p {security: <clearance>}-) RETURN p 

Using the same animal node database, that same query run directly in the Cypher GUI 
returns the graph shown in Figure 6.1. The query for the plugin run in the database GUI 
returns not only the names of the nodes but also the full set of properties for each node, 
which can be viewed by selecting the node in the graph view or by switching to row view. 
Displaying this additional data about the nodes is consistent with the intentions of granular 
security, since all the data for a given node is classified at the same security level as the 
node itself. 

In addition to the two nodes at the Y-level, the Neo4j GUI displays the induced subgraph 
for the two nodes. An induced subgraph contains the set of all edges (u,v) for every u and 
V that are a part of the subgraph [5]. In a graph database, for every pair of nodes displayed, 
a query also returns any relationships that connect those two nodes if such an edge exists 
in the database. With Neo4j these relationships contain labels and properties that are also 
returned with the query. Generating a sound security policy to return the appropriate rela- 
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tionships for the induced subgraph for the results of a query at a given security level is the 
complicated aspect of implementing granular security in a graph database. 


EATS 

<id>: 0 name: fox security: Y color: red habitat: forest legs: 4 

Figure 6.1: Results from Neo4j GUI query with granular security. 

Neo4j can also return results to a query on only relationships, such as: 

MATCH 0 -{r:EATS>-> () RETURN r 

In this case, the graph returned would be a subgraph of the relationships returned as well as 
the nodes at either end of those relationships (i.e., the same graph returned in Figure 6.1). In 
addition, Neo4j would return an expanded induced subgraph of those nodes by including 
any additional relationships between those nodes even if they were not requested in the 
original query. 

If another type of relationship existed between fox and rabbit, that relationship would also 
be returned. If the fox also had an eats relationship with the deer node then the deer node 
would be returned with the query along with the playswith relationships between the rabbit 
and deer nodes (even though that relationship was not a part of the original query). Proper 
direct security classification of the nodes would prevent the nodes from being returned to 
an unauthorized user, and if the node is not returned then a relationship to that node cannot 
be returned at all. 




6.5 Synthetic Graph Analysis 

Generating and verifying the connections from induced relationships on a graph of nine 
nodes is a simple task to perform visually. The same undertaking on a larger graph can¬ 
not reasonably be performed by hand, and instead must be calculated via computer. The 
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induced relationship creation was performed on a randomly generated graph both within 
Cypher as well as via matrix multiplication on the random graph’s adjacency matrix. 

This random graph was generated [17] using an Erdos-Renyi [18] model where the number 
of nodes was set to 300 and the probability that any given edge exists is 0.02, creating 896 
total edges in this instance. To convert the generated graph into a directed graph, each edge 
was randomly assigned directionality with probability of 0.5 for either direction. 

To simulate an actual security environment with this random graph, each node was assigned 
a security level according to the DOD classification system: Unclassified, Secret, and Top 
Secret. Since the graph was randomly generated, the security assignments were made via 
numbered groups (i.e., nodes 1-100: Unclassified, nodes 101-200: Secret, nodes 201-300: 
Top Secret). 

In the hierarchical security environment of the DOD, a user at a higher level is assumed 
to have access to information at lower levels (with the exception of any additionally com¬ 
partmentalized information as discussed in Chapter 2). The hierarchical structure does not 
fundamentally alter the application or the solution. 

An authorized user has access to some subset of the entire database according to their ac¬ 
cess level. Induced relationships can be created between the security layers that the user 
can access and any other layers in the database. In the DOD hierarchical environment, a 
Secret level user, for instance, also has access to Unclassified information; induced rela¬ 
tionships can then be created between the Secret and Top Secret layers as well as between 
the Unclassified and Top Secret layers. Induced relationships could also be created be¬ 
tween the Unclassified and Secret layers, but since the user has access to both layers no 
new connections would be identified. 

The directed graph with security properties that resulted from the random graph generation 
was loaded into Neo4j. The database was queried to find the induced relationships that ex¬ 
isted between each pair of security levels using the following query where <user’s security 
level> and <other security level> are replaced by "Unclassified", "Secret", or "Top Secret" 
(six queries total, one for each possible pair of security levels): 
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MATCH (n {security: <user’s security level>}-) -[]-> 

(m {security: <other security level>}-) -[]-> 

(p {security: <users’s security level>}-) 

RETURN n, m, p; 

In Neo4j the query found 107 relationships that traversed from an Unelassified node to a 
Top Seeret node and baek to an Unelassified node. 

The Unelassified by Top Seeret and Top Seeret by Unelassified adjaeeney matriees for the 
graph were multiplied as in Equations (5.6) and (5.7). That multiplieation returned a matrix 
of 107 indueed edges between Top Seeret nodes. This same step was performed for every 
layer pairing in both direetions (6 parings total). The Cypher queries locating the induced 
relationships and the matrix multiplication of the bipartite graphs of each security layer 
pair agreed in every instance how many induced edges existed. 

The induced edges for the Unclassified layer were added using the following Cypher com¬ 
mand (plus the corollary for the induced edges with the Secret level): 

MATCH (n {security: "Unclassified"}-) - []-> 

(m {security: "Top Secret"}-) -[]-> 

(p {security: "Unclassified"}-) 

CREATE (n) -[r:INDUCED]-> (p); 

These two commands added 201 relationships to the Unclassified layer (107 with the Top 
Secret layer and 94 with the Secret layer) increasing the number of relationships in that 
layer from 101 to 302. The graphs for the Unclassified layer can be seen in Figure 6.2 both 
with and without the induced edges. 

The graph of the Unclassified layer in Figure 6.2(b) is much more connected than its coun¬ 
terpart without the induced relationships. Adding the induced relationships reduced the 
number of connected components of the graph from 17 down to one, meaning that all the 
nodes in the induced relationship graph are connected to each other vice some nodes being 
unreachable in the original graph. 

In addition, the clustering coefficient, a measure ranging from zero to one of how connected 
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a graph is based on the number of triangles formed by the edges [10], of the original Top 
Secret layer is zero. The graph with induced relationships, on the other hand, has a cluster¬ 
ing coefficient of 0.029, showing that the connections between the nodes is much higher if 
the user is able to look for those connections outside of a single layer. 
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(a) Onclassified layer. 



(b) Onclassified layer with induced relation¬ 
ships. 


Figure 6.2: Unclassified layer as rendered in Neo4j. (a) has only the original trivial relation¬ 
ships that exist in the Unclassified layer, (b) contains all the original relationships with induced 
relationships added to the graph. 


When the induced edges are created across the entire graph, an additional 575 edges are 
produced (186 in the Top Secret layer and 188 in the Secret layer). The nodes in this 
random graph are much more connected than a user restricted to the relationships contained 
within a single level. Allowing the inclusion of induced relationships in the graph database 
allows the user better access to connections between data points while restricting him to 
his authorized security level. 


6.6 Real World Graph Analysis 

To better explore induced relationships, a real directed network was partitioned and exam¬ 
ined. The network used was a modified version of links between Google’s web pages [19]. 
The original data set contained 15,763 nodes and 171,206 relationships. 
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To make the data set more manageable, two large seetions were removed: international 
and university. These seetions were removed both beeause they were extremely large and 
easy to extraet and because they functioned as separate mini-Google environments that 
were highly connected within themselves, but not very connected across the full set of web 
pages. 

Additionally, a few comparatively high degree nodes were removed. Google’s home page, 
for instance, had over 10,000 connections while most pages possessed 200 or less. These 
high degree nodes would cause a large number of induced relationships; they were removed 
to make the degree distribution more equitable across all nodes. The resulting data set 
contained 1,125 nodes and 16,941 relationships. 

The web pages were instead split into three groups that were arbitrarily assigned classi¬ 
fication levels of Unclassified, Secret, and Top Secret. An attempt was made to use the 
Louvain method [20] to find natural groups within the data set, and to use those com¬ 
munities to simulate classification levels. The community detection algorithm, however, 
returned one large community and several much smaller ones that would not conform to 
the three-security-level model. 

Divisions were, instead, made alphabetically on the basis of the Uniform Resource Locator 
(URL) under the assumption that pages who’s URLs begin with "http://www.google.com/ 
support" are more likely to be connected to, and as a corollary be classified in the same 
level as, web pages that have the same URL beginning. The groups sizes were 411 nodes 
in the Top Secret level, 343 nodes in the Secret level, and 353 nodes in the Unclassified 
level. 

Searching for induced relationships on the Unclassified level in the Google web page 
database returns an additional 5,210 relationships (240 passing through the Top Secret 
level and 4,970 passing through the Secret level). Because the security levels were ar¬ 
bitrarily assigned, any one of them could be assigned the Unclassified security label. Thus, 
the induced relationships between all levels are equally relevant. The induced relationships 
for the other levels are: Top Secret-455 (141 with the Secret level and 455 with the Unclas¬ 
sified level) and Secret-1,824 (264 with the Top Secret level and 1,562 with the Unclassified 
level). 
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The density of a graph is a ratio of how many edges exist to the total number of possible 
edges for a given number of nodes. For a directed graph, this calculation includes the 
possibility for two distinct edges between every pair of nodes, one in each direction. The 
density of induced edges created can be computed according to: 


|v|(|y|-i) 


( 6 . 1 ) 


where l^l is the number of induced edges that were created, and |y| is the number of 
vertices in the security level. The density of the Top Secret layer is 0.0027. The density 
of the other two layers, with more induced edges is higher: Secret-0.016 and Unclassified- 
0.042. 


In all cases, the density of the layers increased with the addition of the induced relation¬ 
ships. The Unclassified layer density increased from 0.012 to 0.054. The Secret layer 
density increased from 0.016 to 0.031. And, the Top Secret layer density increased from 
0.058 to 0.061. 

By way of comparison, a synthetic graph was generated [17] using an Erdos-Renyi [18] 
model where the number of nodes and the number of edges are both specified. These 
values were set to match the real world graph (1,125 nodes and 16,941 edges). The nodes 
were then split into three equal groups to simulate the Top Secret, Secret, and Unclassified 
security levels. This configuration created approximately 19,00 induced relationships for 
each level producing an induced edge density of approximately 0.15 for each security level, 
an approximate density increase of 0.040 to 0.18 for each level. 

Although the density of induced relationships and the density increases are lower for the 
real network than for the synthetic graph, this outcome was expected due to the more 
Poisson degree distribution typical of synthetic graphs as opposed to the more power law 
degree distribution characteristic of real world networks. The higher number of moderately 
connected nodes in the synthetic graph would lead to more connections traversing security 
levels simply because a higher degree on a random graph increases the likelihood that a 
node at the other end of one of the relationships is in another security level. 

The densities for the real network increased by almost 100% in one case and by 450% 
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in another. A graph database with a granular security implementation that gives users 
access to induced relationships will show the connections between data points more than a 
completely isolated graph database, which allows no interaction between different security 
levels. 
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CHAPTER 7: 

Conclusion and Future Work 


This thesis examined the feasibility of implementing granular security onto a graph 
database looking specifically at retaining the ability of the graph to connect data points. 
Neo4j was used as the actual graph database for both providing a concrete basis for imple¬ 
mentation goals and as a testing ground for the functionality of the algorithms. 


7.1 Conclusions 

After introducing the current state of database models and the lack of granular security fea¬ 
tures on those models, the general security models and data management were discussed. 
The concept of granular security for a database was introduced. 

A discussion of graphs and graph databases then followed to give the reader enough work¬ 
ing knowledge of the topics to understand the subsequent research. The basics of granular 
security implementation on the database nodes was discussed next, followed by the intro¬ 
duction of the complication to implementation provided by relationships along with the 
three main types: trivial, restricted, and induced. A proof of concept for implementing 
granular security on Neo4j was also briefly demonstrated. 

Details of the possible methods of mathematical justification were then described and ana¬ 
lyzed appropriate. Graph theory concepts, which were determined as not helpful to the im¬ 
plementation of granular security include hypergraphs, weighted graphs, temporal graphs, 
colored graphs, and multiplex graphs. Multi-slice graphs and the adjacency matrices of the 
resulting bipartite graphs were used to demonstrate a mathematical justification for gener¬ 
ating induced relationships. 

The method for generating induced relationships as well as the correlated query in Neo4j 
were performed on a synthetic network and shown to produce equivalent results. The 
induced relationships generated in a single security layer of the synthetic graph were shown 
to better show the user the existing connections between nodes in the database. 
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The induced relationships were also generated on a real world network with artificial par¬ 
titions. The density of each layer of a graph showed marked increase with the addition 
of induced edges indicating that including those relationships in a granular security imple¬ 
mentation shows data connectivity better than simply partitioning security layers without 
including the induced relationships. 


7.2 Future Work 

The scope of this thesis is limited to provide an introduction to the proof of concept of 
implementing granular database security on a data set with security partitions. The full 
concept would need to be further explored in the following areas prior to drawing any 
complete conclusions about the implementation of granular security. 

7.2.1 Aggregating Information 

Related to Section 7.2.2 is the idea of critical information as it relates to Operations Security 
(OPSEC) where several pieces of information at one classification level can be combined 
to raise the classification level of the aggregated data. Determining a method for addressing 
when these combinations of critical information exist and induced relationships should not 
be drawn is crucial to implementing granular security. 

7.2.2 Longer Induced Relationships 

The induced relationships examined in the present work were simple two-hop paths that 
traversed to another security level and immediately back to the user’s security level. Paths 
that hop to another security level, then hop to another node in that same security level 
before hopping back to the user’s security level, as shown in Figure 7.1(a), are the most 
basic version of a longer induced relationship. 

These relationships can go beyond this basic form to a path that crosses into multiple other 
security levels before returning to the user’s security level, as shown in Figure 7.1(b). Ul¬ 
timately, the induced relationship construct must address the N-Hop induced relationship, 
illustrated in Figure 7.1(c), where a path departs the user’s security level and traverses an 
unknown path of unknown length before returning to the original security level. 
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The mathematical justification of finding these traversals is likely similar to the bipartite 
matrix multiplication as discussed in Chapter 5. However, the benefit from including all of 
these relationships versus the security of the database from including them must be further 
examined and understood. 



(a) 3-Hop Path 




Figure 7.1: Longer induced relationship examples. 


7.2.3 Hierarchy and Compartments 

This thesis examined the detection and production of induced relationships in a compart¬ 
mentalized structure and showed that the concept was trivially applicable to a hierarchical 
security environment. In the actual DOD security structure hierarchical and compartmen¬ 
talized models are combined; compartmentalized partitions exist within the hierarchical 
levels. The implementation of granular security in a graph database would need to be able 
to adequately address the combination of the two security structures. 

7.2.4 Testing on Secure Data Set 

Lastly, this thesis examined the proof of concept by implementing granular security on 
both a synthetic graph as well as a real world data set. Even the real world network, 
however, had the security levels artificially added. Thorough research on granular security 
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would need to examine how an actual classified data set would respond to the production 
of induced relationships taking into account all the areas for further exploration discussed 
in Section 7.2. 
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